co
We all know that air pollution is a factor in climate change. What we also know is that this is a global problem. But could it also be that the emissions of a country could already have a direct influence on the annual average temperature of that country? With data from the Community Emissions Data System (CEDS) and temperature data from FAOSTAT Temperature Change, this statement will be verified within a CRISP-DM analysis.
CEDS: https://github.com/JGCRI/CEDS
FAOSTAT: https://www.fao.org/faostat
Data Sources
CEDS-Data: https://ourworldindata.org/explorers/air-pollution
FAOSTAT-Temperature-Data: https://www.fao.org/faostat/en/#data/ET
# load emmisions data
df_emmisions = pd.read_csv(base_path+"/air-pollution_2019.csv", sep=',')
df_emmisions.sample(5)
| Nitrogen oxide (NOx) | Sulphur dioxide (SO₂) | Carbon monoxide (CO) | Organic carbon (OC) | NMVOCs | Black carbon (BC) | Ammonia (NH₃) | Entity | Year | |
|---|---|---|---|---|---|---|---|---|---|
| 15133 | 25.057906 | 3.620987e+01 | 1.499089e+04 | 850.820073 | 1.940318e+03 | 229.073486 | 7.248356e+02 | Gabon | 1883 |
| 33322 | 391.857270 | 1.758313e+02 | 1.004545e+05 | 4029.741103 | 9.482479e+03 | 1117.245142 | 6.164779e+03 | Peru | 1822 |
| 24939 | 150.218202 | 5.406977e+01 | 8.257835e+04 | 214.996717 | 3.767215e+03 | 20.608560 | 2.604822e+03 | Luxembourg | 1924 |
| 24602 | 61732.002532 | 6.799014e+04 | 1.634153e+07 | 802203.318679 | 1.882268e+06 | 214667.349197 | 1.033916e+06 | Lower-middle-income countries | 1812 |
| 7437 | 519926.777569 | 2.474886e+06 | 6.560237e+06 | 190202.789300 | 1.389955e+06 | 40464.242959 | 1.116856e+05 | Canada | 1937 |
df_emmisions.info()
print("shape:", df_emmisions.shape)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 47535 entries, 0 to 47534 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Nitrogen oxide (NOx) 47535 non-null float64 1 Sulphur dioxide (SO₂) 47535 non-null float64 2 Carbon monoxide (CO) 47535 non-null float64 3 Organic carbon (OC) 47535 non-null float64 4 NMVOCs 47535 non-null float64 5 Black carbon (BC) 47535 non-null float64 6 Ammonia (NH₃) 47535 non-null float64 7 Entity 47535 non-null object 8 Year 47535 non-null int64 dtypes: float64(7), int64(1), object(1) memory usage: 3.3+ MB shape: (47535, 9)
df_emmisions.describe()
| Nitrogen oxide (NOx) | Sulphur dioxide (SO₂) | Carbon monoxide (CO) | Organic carbon (OC) | NMVOCs | Black carbon (BC) | Ammonia (NH₃) | Year | |
|---|---|---|---|---|---|---|---|---|
| count | 4.753500e+04 | 4.753500e+04 | 4.753500e+04 | 4.753500e+04 | 4.753500e+04 | 4.753500e+04 | 4.753500e+04 | 47535.000000 |
| mean | 5.260436e+05 | 7.992584e+05 | 5.173733e+06 | 1.302651e+05 | 8.868047e+05 | 4.678637e+04 | 3.417240e+05 | 1909.436731 |
| std | 4.332972e+06 | 5.743331e+06 | 3.059404e+07 | 6.744725e+05 | 5.927514e+06 | 2.771701e+05 | 2.250404e+06 | 66.777467 |
| min | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1750.000000 |
| 25% | 1.548467e+02 | 6.235476e+01 | 1.469565e+04 | 6.073753e+02 | 2.245105e+03 | 1.514704e+02 | 1.399557e+03 | 1853.000000 |
| 50% | 1.689596e+03 | 8.450006e+02 | 1.163538e+05 | 4.364754e+03 | 1.728262e+04 | 1.065555e+03 | 9.817929e+03 | 1912.000000 |
| 75% | 2.899340e+04 | 3.199098e+04 | 6.540220e+05 | 2.160645e+04 | 1.139918e+05 | 6.333218e+03 | 5.276752e+04 | 1968.000000 |
| max | 1.095663e+08 | 1.335263e+08 | 6.616236e+08 | 1.401013e+07 | 1.468331e+08 | 6.314146e+06 | 6.142059e+07 | 2019.000000 |
Nitrogen oxides (NOx) - gases consisting of nitrogen and oxygen, cause smog and acid rain, are emitted by vehicles and industrial processes.
Sulphur dioxide (SO2) - a colourless gas, causes acid rain, emitted from burning fossil fuels, especially coal.
Carbon monoxide (CO) - a colourless and odourless gas, toxic to humans, emitted from vehicles and the incomplete combustion of fossil fuels.
Organic carbon (OC) - a form of carbon, major component of particulate matter, emitted from vehicles and industrial processes.
NMVOCs (Non-Methane Volatile Organic Compounds) - a group of organic gases that contribute to the formation of smog and particulate matter and are emitted by industrial processes, vehicles and solvents.
Black carbon (BC) - a type of particulate matter composed mainly of carbon and produced by the incomplete combustion of fossil fuels, biofuels and biomass.
Ammonia (NH3) - colourless, strong-smelling gas that contributes to particulate matter and acid rain and is released by agricultural activities, especially livestock and fertilisers, industry and waste management.
(Hoesly et al, Historical (1750–2014) anthropogenic emissions of reactive gases and aerosols from the Community Emissions Data System (CEDS). Geosci. Model Dev. 11, 369-408, 2018a.)
sn.boxplot(data=df_emmisions, orient="h", palette="Set2")
<AxesSubplot:>
Lets use the zscore to see the scattering of the column values in the dataframe
from scipy.stats import zscore
def abs_zscore(x):
return np.abs(zscore(x))
df_numeric = df_emmisions.select_dtypes(include=np.number)
df_zscore = df_numeric.apply(abs_zscore)
threshold = 5
outliers = df_emmisions[(df_zscore>threshold).any(axis='columns')]
none_outliers = df_emmisions[~(df_zscore>threshold).any(axis='columns')]
def plot_outliers(outliers, none_outliers, number_of_cols, number_of_rows, figsize=(24, 10)):
fig, axes = plt.subplots(number_of_cols, number_of_rows, figsize=figsize)
axe = axes.ravel()
# assign the plot to each subplot in axe
for i, c in enumerate(outliers.columns):
outliers[[c, 'Year']].plot(kind='scatter', x=c, y='Year', ax=axe[i], color='red')
none_outliers[[c, 'Year']].plot(kind='scatter', x=c, y='Year', ax=axe[i])
if c == 'Entity':
axe[i].set(xticklabels=[])
plot_outliers(outliers, none_outliers, 3, 3, (24, 10))
# get outlier values of Entity
unique_entities = outliers['Entity'].unique()
unique_entities, outliers[['Carbon monoxide (CO)', 'Year', 'Entity']].describe()
(array(['Asia', 'China', 'Europe', 'High-income countries',
'Lower-middle-income countries', 'North America', 'United States',
'Upper-middle-income countries', 'World'], dtype=object),
Carbon monoxide (CO) Year
count 5.950000e+02 595.000000
mean 2.221171e+08 1964.816807
std 1.381388e+08 45.108978
min 6.614170e+07 1832.000000
25% 1.298270e+08 1947.000000
50% 1.819542e+08 1979.000000
75% 2.648116e+08 1996.000000
max 6.616236e+08 2019.000000)
As we can see, the dataset also contains calculations for complete continents and grouped countries.
We only want to research on a country level. Thats why these values can be removed.
none_countries = ['Asia', 'Africa', 'Europe', 'High-income countries',
'Lower-middle-income countries', 'North America', 'South America',
'Upper-middle-income countries', 'World']
filtered_outliers = outliers[~outliers['Entity'].isin(none_countries)]
filtered_outliers[['Carbon monoxide (CO)', 'Year', 'Entity']].describe()
| Carbon monoxide (CO) | Year | |
|---|---|---|
| count | 6.100000e+01 | 61.000000 |
| mean | 1.723337e+08 | 1992.540984 |
| std | 2.635135e+07 | 14.450575 |
| min | 1.063980e+08 | 1967.000000 |
| 25% | 1.533517e+08 | 1982.000000 |
| 50% | 1.718056e+08 | 1991.000000 |
| 75% | 1.964603e+08 | 2004.000000 |
| max | 2.155329e+08 | 2019.000000 |
import plotly
plotly.offline.init_notebook_mode()
plotly.express.choropleth(data_frame=df_emmisions[~df_emmisions['Entity'].isin(none_countries)],
locationmode='country names',locations='Entity', color='Carbon monoxide (CO)',animation_frame='Year',
title='Carbon emission by country and year',color_continuous_scale=plotly.express.colors.sequential.Plasma,range_color=(100, 0))
fig, axes = plt.subplots(1, 2, figsize=(22, 7))
df_emmisions[df_emmisions['Entity']=='Germany'].plot(x='Year', ylabel="Emissions in million tons", title='Emissions in Germany', ax=axes[0])
df_emmisions[df_emmisions['Entity']=='United States'].plot(x='Year', ylabel="Emissions in million tons", title='Emissions in the USA', ax=axes[1])
<AxesSubplot:title={'center':'Emissions in the USA'}, xlabel='Year', ylabel='Emissions in million tons'>
# load emmisions data
df_temperature = pd.read_csv(base_path+"/temperature_data_2019.csv", sep=',')
df_temperature.sample(5)
| Domain Code | Domain | Area Code (M49) | Area | Element Code | Element | Months Code | Months | Year Code | Year | Unit | Value | Flag | Flag Description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24194 | ET | Temperature change | 626 | Timor-Leste | 7271 | Temperature change | 7020 | Meteorological year | 1981 | 1981 | °C | 0.734 | E | Estimated value |
| 5311 | ET | Temperature change | 156 | China, mainland | 6078 | Standard Deviation | 7020 | Meteorological year | 1969 | 1969 | °C | 0.282 | E | Estimated value |
| 24584 | ET | Temperature change | 776 | Tonga | 7271 | Temperature change | 7020 | Meteorological year | 2005 | 2005 | °C | 0.540 | E | Estimated value |
| 22300 | ET | Temperature change | 702 | Singapore | 7271 | Temperature change | 7020 | Meteorological year | 1997 | 1997 | °C | NaN | O | Missing value |
| 11190 | ET | Temperature change | 332 | Haiti | 6078 | Standard Deviation | 7020 | Meteorological year | 1998 | 1998 | °C | 0.295 | E | Estimated value |
df_temperature.info()
print("shape:", df_temperature.shape)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 27528 entries, 0 to 27527 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Domain Code 27528 non-null object 1 Domain 27528 non-null object 2 Area Code (M49) 27528 non-null int64 3 Area 27528 non-null object 4 Element Code 27528 non-null int64 5 Element 27528 non-null object 6 Months Code 27528 non-null int64 7 Months 27528 non-null object 8 Year Code 27528 non-null int64 9 Year 27528 non-null int64 10 Unit 27528 non-null object 11 Value 25416 non-null float64 12 Flag 27528 non-null object 13 Flag Description 27528 non-null object dtypes: float64(1), int64(5), object(8) memory usage: 2.9+ MB shape: (27528, 14)
def temperature_df_preperation(df):
df = df.drop(df[df_temperature.Element == "Standard Deviation"].index)
df = df.rename(columns={'Value': 'Temperature Change'})
return df
df_temperature = temperature_df_preperation(df_temperature)
df_temperature[df_temperature['Area']=='Germany'].plot(x='Year', y="Temperature Change", ylabel="°C", figsize=(10, 5), title='Temperature Change in Germany')
<AxesSubplot:title={'center':'Temperature Change in Germany'}, xlabel='Year', ylabel='°C'>
df_temperature_numeric_only = df_temperature.select_dtypes(include=np.number)
df_temperature_zscore = df_temperature_numeric_only.apply(abs_zscore)
threshold = 5
outliers_temperature = df_temperature[(df_temperature_zscore>threshold).any(axis='columns')]
none_outliers_temperature = df_temperature[~(df_temperature_zscore>threshold).any(axis='columns')]
plot_outliers(outliers_temperature[["Year", "Temperature Change"]], none_outliers_temperature[["Year", "Temperature Change"]], 1, 2, (20, 6))
plotly.express.choropleth(data_frame=df_temperature, locationmode='country names', locations='Area',
color='Temperature Change', animation_frame='Year', title='Annual temperature change per country',
color_continuous_scale=plotly.express.colors.sequential.Plasma, range_color=(100, 0))
We need to clean columns and rows that are not intresting in both datasets.
Additionally we want to merge both dataframes for further analysis.
# check for null values in both datasets
print(df_emmisions.isnull().sum())
print(df_temperature[["Year", "Temperature Change", "Area"]].isnull().sum())
# drop null values
df_temperature = df_temperature.dropna()
Nitrogen oxide (NOx) 0 Sulphur dioxide (SO₂) 0 Carbon monoxide (CO) 0 Organic carbon (OC) 0 NMVOCs 0 Black carbon (BC) 0 Ammonia (NH₃) 0 Entity 0 Year 0 dtype: int64 Year 0 Temperature Change 516 Area 0 dtype: int64
We want to merge the two dataframes by Year->Year and Entity->Area unfortunaly it looks like not all Entity labels are the same as the Area labels.
In the following cell you can see the non-existing labels in the temperature dataframe, that are missing in the emissions dataframe. A "difflib" function was used to see the closest matches of missing labels. With this result the temperature dataframe was manually edited to use the same labels as the emission dataframe. But as you can see, not all labels could be found.
# compare Area and Entity in both data sets
import difflib
df_emmisions_countries = df_emmisions.Entity.unique()
df_temperature_countries = df_temperature.Area.unique()
diff_countries = set(df_temperature_countries).difference(set(df_emmisions_countries))
for country in diff_countries:
print(country, difflib.get_close_matches(country, df_emmisions_countries))
United Republic of Tanzania [] San Marino [] Isle of Man [] French Southern Territories [] Republic of Moldova [] Viet Nam ['Vietnam'] China, Taiwan Province of [] Anguilla ['Angola'] Democratic People's Republic of Korea ['Democratic Republic of Congo'] Channel Islands ['Cayman Islands', 'Marshall Islands', 'Faeroe Islands'] Russian Federation [] Saint Helena, Ascension and Tristan da Cunha [] Holy See [] Antarctica ['Africa'] Syrian Arab Republic ['Dominican Republic'] Côte d'Ivoire ["Cote d'Ivoire"] Serbia and Montenegro ['Montenegro'] Brunei Darussalam [] South Georgia and the South Sandwich Islands [] Norfolk Island ['Cook Islands', 'Faeroe Islands', 'Solomon Islands'] Micronesia (Federated States of) [] Pacific Islands Trust Territory [] Andorra ['Angola'] Yugoslav SFR [] Democratic Republic of the Congo ['Democratic Republic of Congo'] Timor-Leste ['Timor'] Türkiye ['Turkey'] United States of America ['United States', 'United States Virgin Islands'] Faroe Islands ['Faeroe Islands', 'Cook Islands', 'Marshall Islands'] United Kingdom of Great Britain and Northern Ireland [] Pitcairn [] Netherlands Antilles (former) [] China, Hong Kong SAR ['Hong Kong'] Iran (Islamic Republic of) [] Belgium-Luxembourg ['Luxembourg'] Nauru [] French Guyana ['French Guiana', 'French Polynesia', 'Guyana'] China, Macao SAR [] Wake Island ['Faeroe Islands', 'Cook Islands', 'Falkland Islands'] Cabo Verde ['Cape Verde'] Mayotte [] Midway Island ['Cayman Islands', 'Marshall Islands'] Czechoslovakia ['Czechia', 'Slovakia'] Réunion ['Reunion'] China, mainland ['Cayman Islands', 'Thailand'] Venezuela (Bolivarian Republic of) [] Cocos (Keeling) Islands ['Cook Islands'] Republic of Korea [] Sudan (former) [] Christmas Island ['Cayman Islands'] Svalbard and Jan Mayen Islands [] Ethiopia PDR ['Ethiopia'] Lao People's Democratic Republic ['Democratic Republic of Congo'] Palestine [] USSR [] Wallis and Futuna Islands ['Wallis and Futuna', 'Falkland Islands', 'Turks and Caicos Islands'] Bolivia (Plurinational State of) [] Tuvalu [] Falkland Islands (Malvinas) ['Falkland Islands'] Monaco ['Macao', 'Morocco']
# read in edited temperature dataframe
df_temperature = pd.read_csv(base_path+"/temperature_data_2019_country_corrected.csv", sep=',')
df_temperature = temperature_df_preperation(df_temperature)
df_temperature = df_temperature.dropna()
# merge data sets
new_df = pd.merge(df_emmisions, df_temperature, how='inner', left_on=['Entity','Year'], right_on = ['Area','Year'])
# remove unnecessary columns
new_merged_df = new_df[['Nitrogen oxide (NOx)', 'Sulphur dioxide (SO₂)', 'Carbon monoxide (CO)','Organic carbon (OC)', 'NMVOCs', 'Black carbon (BC)', 'Ammonia (NH₃)', 'Year', 'Temperature Change', 'Entity']]
new_df.columns, new_merged_df.columns
(Index(['Nitrogen oxide (NOx)', 'Sulphur dioxide (SO₂)', 'Carbon monoxide (CO)',
'Organic carbon (OC)', 'NMVOCs', 'Black carbon (BC)', 'Ammonia (NH₃)',
'Entity', 'Year', 'Domain Code', 'Domain', 'Area Code (M49)', 'Area',
'Element Code', 'Element', 'Months Code', 'Months', 'Year Code', 'Unit',
'Temperature Change', 'Flag', 'Flag Description'],
dtype='object'),
Index(['Nitrogen oxide (NOx)', 'Sulphur dioxide (SO₂)', 'Carbon monoxide (CO)',
'Organic carbon (OC)', 'NMVOCs', 'Black carbon (BC)', 'Ammonia (NH₃)',
'Year', 'Temperature Change', 'Entity'],
dtype='object'))
Let's create a simple linear correlation heatmap. Between the pollution values and the temperature change value. The data is currently not grouped by country.
plt.figure(figsize=(8, 6))
new_df_cols = new_df[['Nitrogen oxide (NOx)', 'Sulphur dioxide (SO₂)', 'Carbon monoxide (CO)','Organic carbon (OC)', 'NMVOCs', 'Black carbon (BC)', 'Ammonia (NH₃)', 'Year', 'Temperature Change', 'Entity']]
heatmap = sn.heatmap(new_df_cols.corr()[['Temperature Change']].sort_values(by='Temperature Change', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with temperature change', fontdict={'fontsize':18}, pad=16);
Let's try this correlation analysis with data that comes only from the country the emission were produced in.
# create a correlation heatmap for the individual countries
corrs_per_country = new_df_cols.groupby(['Entity']).corrwith(new_df_cols['Temperature Change'])
heatmap = sn.heatmap(corrs_per_country.iloc[65:73].sort_values(by='Temperature Change', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
# define final dataframe for data modelling
df_prepared = new_df[['Nitrogen oxide (NOx)', 'Sulphur dioxide (SO₂)', 'Carbon monoxide (CO)','Organic carbon (OC)', 'NMVOCs', 'Black carbon (BC)', 'Ammonia (NH₃)', 'Year', 'Entity', 'Temperature Change']]
# prepare a model (58 Years of data per country) and leave out 2019 for later evaluation
df_regression_germany = df_prepared[(df_prepared['Entity']=='Germany') & (df_prepared['Year']<2019)].sort_values(['Entity', 'Year'])
df_regression_all = df_prepared[(df_prepared['Year']<2019)].sort_values(['Entity', 'Year'])
df_evaluation_germany = df_prepared[(df_prepared['Entity']=='Germany') & (df_prepared['Year']>=2019)].sort_values(['Entity', 'Year'])
df_evaluation_all = df_prepared[(df_prepared['Year']>=2019)].sort_values(['Entity', 'Year'])
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
scaler = MinMaxScaler()
df_regression_x = df_regression_germany[['Nitrogen oxide (NOx)', 'Sulphur dioxide (SO₂)', 'Carbon monoxide (CO)','Organic carbon (OC)', 'NMVOCs', 'Black carbon (BC)', 'Ammonia (NH₃)', 'Year']]
df_regression_y = df_regression_germany[['Temperature Change']]
X_train, X_test, y_train, y_test = train_test_split(df_regression_x, df_regression_y, test_size=0.2, random_state=42, shuffle=True)
# later used for evaluation
X_test_original = X_test.copy()
# min max scale data between 0 and 1
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = scaler.fit_transform(y_train)
y_test = scaler.transform(y_test)
Use a machine learning "Random Forest" model for regression.
Tune Hyperparameters with random cv search. (reaches gradient descent for best values faster)
Deep Learning - Bengio, Goodfellow, Courville
For the regression a simple RandomForestRegressor from the sklearn lib was used.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
A Random Forest is a type of machine learning algorithm that builds multiple decision trees (hence "forest") and combines their predictions to classify new data points. Each decision tree is built using a random sample of the data and a random subset of the features, which helps to reduce overfitting and improve the overall accuracy of the model. The final prediction is made by taking a majority vote of the predictions from all the decision trees in the forest. It is called random because it randomly selects observations and features to build decision trees.
https://towardsdatascience.com/understanding-random-forest-58381e0602d2
# random search for RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 4, stop = 50, num = 4)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(2, 10, num = 2)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 4, 8]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
# First create the base model to tune
rf = RandomForestRegressor()
# Use the random grid to search for best hyperparameters
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 3, cv = 3, verbose=0, random_state=42, n_jobs = -1)
# Fit the random search model (ravel is used to flatten the y_train array)
rf_random.fit(X_train, y_train.ravel())
best_params = rf_random.best_params_
print(best_params)
{'n_estimators': 4, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 10, 'bootstrap': False}
# RandomForrestRegressor
print(best_params)
#regressor = RandomForestRegressor(n_estimators=best_params['n_estimators'], min_samples_split=best_params['min_samples_split'], min_samples_leaf=best_params['min_samples_leaf'], max_features=best_params['max_features'], max_depth=best_params['max_depth'], bootstrap=best_params['bootstrap'])
# actual model that will be using for the evaluation
regressor = RandomForestRegressor(n_estimators=4, min_samples_split=2, min_samples_leaf=4, max_features="sqrt", max_depth=10, bootstrap=False)
regressor.fit(X_train, y_train.ravel())
{'n_estimators': 4, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 10, 'bootstrap': False}
RandomForestRegressor(bootstrap=False, max_depth=10, max_features='sqrt',
min_samples_leaf=4, n_estimators=4)
feature_importances = pd.DataFrame(regressor.feature_importances_,
index = df_regression_x.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_importances.plot(kind='barh', figsize=(10, 5), title='Feature importance')
<AxesSubplot:title={'center':'Feature importance'}>
# predict values
y_pred = regressor.predict(X_test)
# rescale results and reshape to 2d array
y_pred = scaler.inverse_transform(y_pred.reshape(-1, 1))
y_test_rescaled = scaler.inverse_transform(y_test.reshape(-1, 1))
X_test_rescaled = scaler.inverse_transform(X_test.reshape(-1, 1))
print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
# plot results
plt.figure(figsize=(10, 5))
plt.plot(y_test_rescaled, color = 'red', label = 'Real Temperature Change')
plt.plot(y_pred, color = 'blue', label = 'Predicted Temperature Change')
plt.title('Temperature Change Prediction')
plt.xlabel('Sample Count')
plt.ylabel('Temperature Change')
plt.legend()
plt.show()
# add Temperature Change Prediction to X_test_original
X_test_original['Temperature Change Prediction'] = y_pred
X_test_original['Temperature Change'] = y_test_rescaled
print(X_test_original[['Year', 'Temperature Change', 'Temperature Change Prediction']])
Mean Squared Error: 0.29113552186029507
Year Temperature Change Temperature Change Prediction 3730 1961 0.990 -0.214018 3735 1966 0.406 -0.324518 3764 1995 1.301 0.641887 3743 1974 0.318 0.223375 3774 2005 1.015 1.132256 3782 2013 0.571 1.353798 3767 1998 1.146 0.994038 3755 1986 -0.011 0.276950 3776 2007 2.080 1.132256 3742 1973 0.188 -0.234167 3778 2009 1.270 1.255381 3733 1964 -0.330 -0.324518
def model_by_country(df):
df_regression_x = df[['Nitrogen oxide (NOx)', 'Sulphur dioxide (SO₂)', 'Carbon monoxide (CO)','Organic carbon (OC)', 'NMVOCs', 'Black carbon (BC)', 'Ammonia (NH₃)', 'Year']]
df_regression_y = df[['Temperature Change']]
X_train, X_test, y_train, y_test = train_test_split(df_regression_x, df_regression_y, test_size=0.2, random_state=42, shuffle=True)
regressor = RandomForestRegressor(n_estimators=4, min_samples_split=2, min_samples_leaf=4, max_features="sqrt", max_depth=10, bootstrap=False)
return regressor.fit(X_train, y_train.values.ravel())
model = df_prepared.sort_values(['Entity', 'Year']).groupby(['Entity']).apply(model_by_country)
model = model.reset_index()
model = model.rename(columns={0: 'Temperature Change Model'})
results_2019 = []
for index, row in df_evaluation_all.iterrows():
row_x = [row['Nitrogen oxide (NOx)'], row['Sulphur dioxide (SO₂)'], row['Carbon monoxide (CO)'], row['Organic carbon (OC)'], row['NMVOCs'], row['Black carbon (BC)'], row['Ammonia (NH₃)'], row['Year']]
row_y = row['Temperature Change']
model_country = model[(model['Entity']==row['Entity'])]
prediction = model_country.iloc[0]["Temperature Change Model"].predict([row_x])
difference = np.abs(prediction - row_y)
error = difference / prediction
mean_squared_error = np.mean(error)
results_2019.append([row['Entity'], row['Year'], prediction[0], row_y, mean_squared_error])
# results_2019 to dataframe
df_results_2019 = pd.DataFrame(results_2019, columns=['Entity', 'Year', 'Prediction', 'Real', 'Mean Squared Error'])
# df_results_2019 avg for Mean Squared Error
print("Error across all countries:",df_results_2019['Mean Squared Error'].mean())
Error across all countries: 0.16168220258697916
# plot results of 2019 seaborn
plt.figure(figsize=(10, 5))
sn.set_style("whitegrid")
sn.barplot(x="Entity", y="Mean Squared Error", data=df_results_2019[68:78])
plt.xticks(rotation=90)
plt.title('Mean Squared Error of 2019')
plt.xlabel('Country')
plt.ylabel('Mean Squared Error')
plt.show()
So what does this result mean for our original question "We see that the original question "Does the air pollution of a country have a direct influence on the annual average temperature?" ?
We can see that it is possible for some countries better for others worse to use the emission values of the current year to predict the temperature change of the country. That means emissions or air pollution can or have a effect on our climate/temperature.
Now we would like to make our country based prediction models available for scientist. So that they can use measured emission values of their country to predict the temperature change for a year.
Therefore we will use flask a simple python web framework to make this possible.
https://flask.palletsprojects.com/en/2.2.x/